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Abstract 

Graphics Processing Units (GPUs) are employed for a numerical determination of the analytic 
structure of two-point correlation functions of Quantum Field Theories. These functions are 
represented through integrals in d-dimensional Euclidean momentum space. Such integrals can 
in general not be solved analytically, and therefore one has to rely on numerical procedures to 
extract their analytic structures if needed. After describing the general outline of the correspond- 
ing algorithm we demonstrate the procedure by providing a completely worked-out example in 
four dimensions for which an exact solution exists. We resolve the analytic structure by highly 
parallel evaluation of the correlation functions momentum space integral in the complex plane. 
The (logarithmically) divergent integral is regularized by applying a BPHZ-like Taylor subtrac- 
tion to the integrand. We find perfect agreement with the exact solution. The fact that each 
point in the complex plane does not need any information from other points makes this a perfect 
candidate for GPU treatment. A significant gain in speed as compared to sequential execution 
is obtained. We also provide typical running times on several GPUs. 

Keywords: analytic structure, Green's function, complex integration, branch cut, GPU, 
CUDA Fortran 



1. Introduction 

In quantum field theory (QFT), the analytic structure of a correlation or Green function is 
tied to the physical interpretation of the propagating degree of freedom. Hereby, the analytic 
structure in the complex plane spanned by analytic continuation of the square of the external 
momentum is considered. (NB: We work throughout in Euclidean momentum space, i.e., we 
consider all integrals after a Wick rotation has been performed.) For example, a two-point 
function of a massive scalar particle features a single pole in the complex plane. In momentum 
space this pole occurs in the timelike momentum region at p 2 = —m 2 , with p being the external 
momentum and m the mass of the particle. Furthermore, if there are more than just one particle 
making up the correlator, a branch cut appears starting at a threshold p 2 = —Am 2 . Also the 
occurrence of bound states among the particles forming the correlator is evident within the 
correlator's analytic structure, since then additional poles exists below the two-particle threshold 
in this case. 

Also the property of positivity of a Green function is encoded in its analytic structure. 
(More precisely, in the Euclidean case, the analytic structure determines whether the property 
of reflection positivity of the correlator's Schwinger function is fulfilled.) Positivity violating 
Green functions do not possess a Kallen-Lehmann representation [H[2]- Their negative norm 
contributions do not allow for a probabilistic interpretation as demanded by a quantum theory. 
Positivity is a necessary condition for a certain state to be part of the physical spectrum of 
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asymptotic states. Correspondingly, the investigation of the analytic structure has been subject 
of many preceding studies, see e.g., refs. O 01 El [7] and references therein. 

The main reason for the development of this code is an investigation of exactly this kind 
[8]. There, the field-strength tensor of pure Yang-Mills (YM) theory is used to construct the 
correlator 

(F 2 (x)F 2 (0)) = J ^e W {ip-xW(p 2 ), (1) 

where &(p 2 ) is the corresponding (Euclidean) momentum space operator whose analytic struc- 
ture is to be investigated, and D is the number of space-time dimensions. Here we have already 
exploited the fact that due to Poincare invariance of YM theory ^{p 2 ) depends only on the 
length of the four- vector p, i.e., on p 2 . The YM field-strength tensor is given by 

F« v = d»A a v - %A% + gf abc A^Al, (2) 

and the square entering the correlator reads (using Einstein's sum convention) 

F 2 = F^F^. (3) 

YM theory accounts for gauge field dynamics only, thus only gluonic degrees of freedom are 
considered. The gauge fields entering the correlator ([!]) might form a bound state, since the 
non-Abelian character of YM-theory induces self-interactions among the gluons. Such bound 
states are called glueballs (see, e.g., ref. [9]). In a (hypothetical) exact calculation they will lead 
to poles in the expression Q. Since gluon propagators of different kinds are used as an input 
in ref. [8], we developed this code which performs the investigation of the analytic structure 
numerically. Note that such integral expressions cannot be solved in general with conventional 
methods. 

This article is intended to provide a step-by-step tutorial for performing such an investigation 
numerically, using the power of parallelism provided by Graphics Processing Units. The code 
has been developed following the F0RTRAN90 standard and has been extended to GPUs by using 
CUDA Fortran, provided by the Portland Group jlOl lllj . Since each evaluation point in the 
complex plane can be treated on its own, the problem is perfectly suitable for a GPU treatment. 

This article is organized as follows: Sect. [2] provides the guideline how to extract the ana- 
lytic structure of a correlator numerically, while Sect. [3] presents a short introduction to GPU 
computing using CUDA [12] to establish the terminology. The numerical implementation of the 
procedure is described in Sect. [4j A worked example which can be solved analytically serves as 
a test case in Sect.[5j where we follow and detail the procedure given in Sect. [2] to reproduce the 
exact results numerically. We conclude in Section [6j 

2. Step by Step to the solution 

Let us assume we have the momentum space operator 9?(p 2 ) corresponding to a certain two- 
point function in four-dimensional Euclidean space as a starting point. Note that the procedure 
can be applied to arbitrary dimensions by trivial modifications, but let us restrict to the physical 
four dimensions for simplicity. The generic structure of such an operator expressed through an 
integral over the internal momentum is 

^(P 2 )= / ^^(P 2 ,k 2 ,p-k), (4) 

where J^(p 2 , k 2 ,p ■ k) is the corresponding integrand which is a function of the squares of the 
external and internal momenta (p 2 and k 2 , respectively) , as well as of the scalar product between 
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the external and the internal momentum. The aim is to find the analytic structure of &(p 2 ) 
in the complex p 2 -plane numerically. In the following we provide the steps that one has to 
carry out to achieve this goal. All one has to provide to follow this procedure is an operator 
expressed in the form as denoted in eq. Q. The steps presented below are then used in the 
worked example in Sect. [5j Steps which are labeled with the attribute (A) are to be performed 
analytical, while steps carrying the attribute (N) are numerical steps. One step carries the 
attribute (A,N) because one can choose to work it out either analytical or numerically. In the 
worked example we show both possibilities. 

• STEP 1, (A): Express Q in hyperspherical coordinates 

For a subsequent numerical treatment it is convenient to transform the integral in eq. Q 
into hyperspherical coordinates: 

/• p2n ^oo pix pn 

d 4 k^ d<t> I dkk 3 / d9 1 sm 2 6 1 / de 2 sme 2 . (5) 
Jtr. 4 Jo Jo Jo Jo 

Applying another change of variables by introducing 

y = k 2 , -»■ dy = 2kdk, (6) 

9\ = arccos z, — > d6\ = dz, (7) 

V 1 — z 2 

02 = arccos w, — > dO? = dz, (8) 

V 1 — w 2 

as well as relabeling p 2 — > x after the transformation, we arrive at 

-i p2n poo pi pi 

d 4 k^t - d<p dyy / dzy/l - z 2 / dw. (9) 

Rewriting the integrand of Q in the new variables we see that the integrand depends on 
x, y and z only, 

,¥(p 2 ,k 2 ,\p\\k\ cos 6i) ^ ^{x,y,V^Vv z). (10) 
Thus we can integrate </> and w trivially: 

j roo rl 

= (2vr)3 J dyy J dzy/ 1 ~ z2 ^( x ' y> V*Vy z )- ( n ) 

STEP 2, (A): Regularization (if applicable) 

In four dimension an integral like expression Q typically diverges logarithmically. If this 
is the case one has to regularize the integral. This step can be skipped if the integral 
is already finite. Here we use the BPHZ-procedure [131 HH HS1 EH] to obtain a finite 
expression. The so-called superficial degree of divergence (SDD) can be determined by 
counting the powers of the inner momenta that appear in ^{x, y, y/x^/y z), as well as 
the powers of the inner momenta present due to the integral measure. Let the SDD = n. 
Then we can render the integral finite by replacing the integrand ,^(x,y, \fx^Jy z) with 
^sub{x, y, yfxy/y z), which is obtained by applying a Taylor-subtraction operator t n up to 
the order of n to the initial integrand, 



^sub(x, y, s/x^/y z) = (l- t n )^(x, y, z), (12) 

where t n is given by 



i=0 



X 



dx l 



(13) 

x=0 
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Performing this step, we arrive at 



I rA 2 rl _ 

^Sub(x) = j—^ J dyy J dzy/l - z 2 ,^ S ub(x,y, \fx~yjy z), (14) 

where upper integration limit A 2 , introduced purely for numerical reasons, assumes some 
large finite value. Note that the BPHZ procedure ensures cut-off independence, i.e., the 
limit lim^. 2 — > oo does not alter the value of the integral. 

STEP 3, (A,N): Analytic continuation 

As it has been pointed out in ref. [6], the z-integral induces a branch cut in the complex 
y-plane. This happens when one integrates over the (integrable) singularities of the z- 
integrand. For certain values of x, the branch cut in the y-plane might obstruct the 



contour of the y-integration along the positive real axis, as it is used in equation (14). If 
the integrand of the angular (z) integral is a well behaved function, and if there are no 
poles of the remaining integrand for the y-integral on the positive real y-axis, then there is 
nothing to do and this step can be skipped. Usually, the region in the complex a>plane for 
which the induced branch cut does not interfere with the y-contour along the positive real 
axis is rather small. To avoid troubles, one has to perform the angular integration (either 
analytical or numerically) and look at the results in the complex y-plane. This can also be 
achieved by finding the complex y- values for which the z-integrand for a given complex x 
becomes singular. In the usual case of the integrand being a fraction, this simply reduces 
to putting the denominator polynomial to zero and solve this equation with respect to 
(complex) y for a certain (complex) value of x. 

As soon as this troublesome regions in the complex x-plane have been identified, one 
has to deform the y-contour in the complex y-plane accordingly, avoiding the branch-cut 
present due to the angular integration, as well as poles which might be present due to the 
y-integrand itself. The most accurate form of the integral corresponding to @Sub( x ) 1S 

^Sub(x) = J dyy J dzy/l - z 2 ,^ S ub(x, y, Vx^ z), (15) 

with ^ being the deformed contour connecting with A 2 , while avoiding the cut and the 
poles. 

STEP 4, (N): Preparation 

Now we have done all the analytical work involved in this investigation. The first (numer- 
ical) step is to determine a (M x N) matrix X which holds the x values at which we want 



to evaluate ^sub(x) as given by equation (15), with the contour adjusted as necessary (see 



STEP 3). M denotes the number of points along the real axis of x, and N is the number 
of points along the imaginary axis of x. The limits for the complex x-values have to be 
chosen as desired. The region of interest in the complex x-plane is then discretized and 
mapped to a matrix X. This step is already performed on the GPU (see Section [4] for the 
implementation of the numerics) . 

STEP 5, (N): Evaluation of the integrals 

This is the final step in this procedure. It is also performed on the GPU, using its par- 



allelism capabilities. For each entry of the matrix X we have to evaluate equation (15), 
which we do by using non-adaptive quadrature rules for approximating the values of the 
integrals. On the GPU we perform these integrations in parallel for a (m x n) sub-matrix of 
X, where the maximal size of the sub-matrix depends on the architecture of the GPU. The 
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result is stored to a file, the subsequent graphics processing of the data is then performed 
by using Mathematica |17j . 

This concludes our generic discussion of the procedure. The numerical steps (STEP 3, 
STEP 4 and STEP 5) are discussed in detail in Sect, The whole procedure (STEP 1 - 
STEP 5) is demonstrated in great detail by the worked example in Sect. [5] 

3. Introduction to CUDA 

In this section we briefly review some features of CUDA in order to make this article self- 
contained. We only describe such features we will need in the following section, for a more 
detailed description the reader is referred to ref. [12] . 

Compute Unified Device Architecture (CUDA) is a general purpose parallel computing archi- 
tecture provided by NVIDIA [12J. CUDA-enabled devices can be addressed using C (with some 
extensions) as a programming language, but there are also other high-level languages available. 
This study has been performed by implementing the code using CUDA Fortran, provided by 
PGI |1CH [TT] . CUDA Fortran is a small set of extensions to Fortran that allows one to use the 
computing resources of CUDA-enabled devices. A brief review of the realm of CUDA is in order 
before we discuss the actual implementation of the steps described in the previous section. 

In contrast to CPUs, GPUs are devices specialized in heavy-duty parallel calculations, as 
they are demanded by their primary purpose, i.e., 3D rendering. A GPU features several SIMD 
multiprocessors. The CUDA programming model is organized as follows. The code controlling 
the calculation runs as a sequential code on a CPU called the host. The task of the host part is to 
control the flow of data to and from the GPU, which is called the device. Each CUDA program 
consist thus of a host part and a device part. A function that is executed on the GPU is called 
a kernel. CUDA possesses a 3-level thread hierarchy. The lowest level of the model is a thread. 
The threads are executed on a scalable array of multithreaded Streaming Processors (SMs). 
A multiprocessor uses the Single-Instruction Multiple- Thread (SIMT) architecture to perform 
the concurrent execution of the threads. The multiprocessor partitions the threads in groups 
of 32 threads, called warps, which are then scheduled and executed. One common instruction 
is executed within a warp at a given time. The next element in the hierarchy is called a block. 
Each block contains a certain number of threads, which should be a multiple of the warp-size for 
an efficient usage of the hardware. The maximum number of threads per block is limited and 
determined by the hardware in use. The highest level in the hierarchy is called the grid. The 
grid consists of blocks and is assigned to a device. On the device, the blocks within the grid are 
distributed among the SMs, where subsets of the blocks, the warps, are scheduled. The threads 
of the warp are then executed on the cores of the SM. Blocks and grids can be either 1, 2 or 
3-dimensional, which also depends on the hardware. Each thread within the hierarchy possesses 
a unique identification number, the thread-ID, which can be calculated from the position of the 
block in the grid and the position of the thread in the block. Besides the thread hierarchy, there 
exists a memory hierarchy on the device. The global device memory can be accessed by each 
thread, where each thread can read and write on this memory. However, it is not a small latency 
memory. In the so-called Fermi architecture, each SM possesses a on-chip memory of 64 KB. 
One can either configure the on-chip memory to use 48 KB for shared memory with 16 KB LI 
cache or to use 16 KB for shared memory with 48 KB LI cache. Fermi furthermore features a 
768 KB L2 cache which connects all SMs. 

A CUDA program is usually of the following form. Parts running on the host are labeled 
with the attribute (H), while device parts are labeled with (D). 

• Initialization (H) The program starts with a host part, where data is allocated in the 
host memory, as well as in the device memory. The grid- and blocksize and dimensions are 
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determined, variables are initialized and prepared for the parallel execution on the device. 
Data is transferred from the host to the device. 

• Kernel execution (D) One or more kernels are executed on the device by a kernel call 
of the host. Each kernel has to be called together with its grid- and blocksize, which also 
includes the information about their dimensionality. The blocks of the grid of the kernel 
are executed on the SMs of the device. 

• Finalization (H) Once the device executions have finished, the result, which still resides 
on the device, has to be transferred back to the host where it can be stored to file. Finalizing 
steps such as memory deallocation are performed and the program closes. 

This rather brief discussion covers only the very basics of the programing model, but should 
provide enough understanding to follow the discussion in the next section. 

4. Numerical Implementation 

On the basis of the short introduction to CUDA outlined in the previous section, we can 
now introduce the actual implementation of the numerical steps of Section [2} The centerpiece 
of course is the integration, which we perform by numerical quadrature. In particular we use 
Gauss-Legendre quadrature [18] for the radial (y) integral, and Tanh-Sinh quadrature (also 
known as double exponential quadrature) |19| for the angular (z) integral. We also implemented 
Gauss-Chebyshev quadratures [18] (by using both, polynomials of the first and second kind, 
which we use as convenient depending on the dimension the problem is located in) to have a 
check on the angular integration results. The code is organized as follows, where again we make 
use of the labels (H) for code running on the host and (D) for device code respectively. 

• Initialization (H) 

Several preparation steps are necessary. 

— Allocation 

Allocation on both, the host and the device is performed for the weights and nodes 
needed for the quadratures. Since it is necessary to integrate along contours which 
are running close to singular structures in the complex plane, we have to use a huge 
number of nodes to get a smooth picture in the end. Another option would have been 
to use adaptive strategies, but they tend to get stuck near singularities and are more 
complicated to implement. We achieved good results with this procedure. 

— Preparation 

The calculation of the weights and nodes that are to be stored in the arrays of size ngl 
for Gauss-Legendre and ndbl for Tanh-Sinh quadrature respectively, are calculated on 
the host. In well-behaved areas it is by far sufficient to use ngl = 64 and ndbl = 49. 
However, in regions where the contour comes close to the singular structures living 
in the complex y-plane, one only obtains good results with much more nodes. Note 
furthermore, that while we only need one array for the angular nodes, we need n 
arrays for the radial nodes, where n is the number of the sub-contours needed to 
form the contour ^ that connects with A 2 in the complex y-plane. Depending on 
the parametrization, we have n = 1 up to n = 5 in our worked example. Besides 
the arrays for the weights and nodes, we need two M x N matrices of the data- 
type complex. The matrix X will contain the complex x values at which we intend 
to perform the calculation, the matrix M is used to store the result in the end. 
Furthermore, the grid and block dimension is set. 
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— Transfer 

The data that has been calculated on the host is transferred to the device global 
memory. 

Kernel 1 (D): Discretize the region of evaluation 

This kernel corresponds to STEP 4 of Section [2j This kernel is used to fill the matrix 
X with the complex values of the region in the complex x-plane where &sub( x ) is to be 
evaluated. We used square matrices with 128 2 points. Usually, 128 2 points are absolutely 
sufficient to produce a reasonable picture of the area, for some purposes we also used 256 2 
and 512 2 points. We usually restrict the real and imaginary part of x to range between 
±5. The kernel is called with blocks of 16 x 16 in size, the grid in case of the 128 2 matrix 
is then formed by 8 x 8 blocks. Each thread, which is described by a unique tuple 
writes the complex number of x to the matrix element X(i,j), where it determines the 
complex number by calculating a homogeneous distribution of the M x N points within 
the given range. The result is a matrix which holds a discretized version of the region of 
interest in the complex x-plane. 

Kernel 2 (D): Evaluation 

This kernel corresponds to STEP 5 of Section [2} It is the center part of the calculation. 
The block and grid sizes are the same as for Kernel 1. Each thread within a block is 
assigned to exactly one value of the X-matrix and calculates the double integral of equation 



(15), where it has to choose the contour that has been assigned to the area the point of 
evaluation Xij £ X belongs to. In our worked example we had to distinguish 4 areas in 
the complex x-plane, so we had to use 4 different contours. The decision which contour is 
the right one can be made by a simple IF-THEN-ELSE statement. The result produced 
by each thread corresponding to a certain value of x is then stored in the result matrix 
M. Once this kernel has finished its execution, the host takes over again. 

• Finalization 

The host transfers the result stored in the matrix M from the device back to the host and 
produces a stream that stores the result, together with the complex x value in a file. The 
file-name is generated dynamically and consists of several parameters to identify the run. 

The whole procedure will be detailed in the worked example in the following section. 

5. A Worked Example 

The procedure as proposed in the step-by-step recipe of Sect. [2] is here detailed using as 
example the correlator given in eq. (32) of ref. [Tj- There exists an exact solution for this 
correlator which makes this example a perfect test-case for the numerics. The correlator is given 
by 

. _ . 2 \ f d k 1 1 . . 

^ ' = J (2ir) D ( p - k) 2 - iy/29 2 k 2 + iV20 2 ' 



In ref. [7] it has been shown that in four dimensions for 2v / 2# 2 = 1 the integral in ( 16 ) can be 
done analytically for the regularized expression yielding 

%ub{x) = ( 1 ~ y + ^ l ~ X arccos 0) ) • ( 17 ) 
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Our numerical result will be compared to it, however, it is useful to rescale (17) to get rid of the 
prefactor: 

%ub,rescaled(x) = f 1 - ^ + ^ - % arCCOs(x) ] . (18) 

\ 2x x I 

The numerical task starts with eq. (16) and continues with the steps outlined in Sect. [2] 

5.1. Step 1: Transform the integral in hyper spherical coordinates 

We investigate (167r 2 ) x [ eq. (16) ] in four dimensions with 2y / 2# 2 = 1. Step 1 demands to 
switch to hyperspherical coordinates, which gives according to eq. ( fTTj ) 

roo rl I 

^reimledix) = / dyv / dzy 1 — Z 2 — -. r-. (19) 

5.2. Step 2: Regularization 



We determine the superficial degree of divergence of the integral by investigating eq. (16). 
In four dimensions we have four powers of the inner momentum k in the numerator due to the 
integral measure. The denominator also produces a highest power of 4 in the momentum k. Thus 
the superficial degree of divergence is 0, which means that the integral diverges logarithmically. 
Following the procedure, we use eq. ( 12 ) with the definition ( 13 ) to get the regularized integrand, 
which is in our case 

&**(x,y,yfry/yz) ={l-t°)^(x,y,Vx^z) (20) 

1 111 
~ (x + y- 2^^yz - f ) (y + |) " (y - §) (y + §) 
-x + 2yfx^/yz 



x + y- 2^^/yz - |)(y 2 + |) 



Performing the cancellations in the prefactor of (|19j), as well as plugging eq. (20) into (14), the 
following equation remains, 



y s ub,rescaled{x) = ~[ dyy [ dz\J \ - Z 2 JC + 2 

k Jo J -i (x + y-2^c^yz- ^){y 2 + {) 



5.3. Step 3: Analytic continuation 

In this step we have to investigate the analytic structure arising in the complex y-plane due 
to the analytic continuation of x = p 2 appearing in the integrand of the angular integral. The 
integral is 




%ub,rescaled(x) = - I dy V 1 / dz\J \ - Z 2 + j- . (22) 

(x + y-2y/x^/yz- |) 



=A =B 



Clearly, term A induces two poles in the complex y-plane, appearing at y = ±|. The angular 
integral B produces a branch cut in the complex y plane once the outer momentum x = p 2 
has been continued to a complex value. The branch cut at a given value of x in the complex 
y-plane can be found analytically by finding the poles of the integrand of B. In order to obtain 
a parametrization, we take the denominator polynomial prior to the change of variables, i.e., 
we solve 

„2 , ;„2 - - - * 



jf + kf- 2pk cos Ox - - = (23) 
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with respect to k. The two (redundant) solutions can be written as 



f(p A) =pcos9 1 ±^-p 2 sm9 1 + - , 0<^i<7T, (24) 
where we named the complex function £ instead of k to stress that we use it as a parametrization 



for the branch cut. Since we solved eq. (23) with respect to k for convenience, the branch cut 
in the complex y = k 2 plane is then given by 



£O 2 ,0i)= ( Vp^cos0i±\l-p 2 sme 1 + ^ 1 , 0<#i<vr. (25) 




With the help of eq. (25), we can plot the branch cut in the complex y-plane for a given value 
of p 2 = x. We also performed this investigation numerically by simply solving integral B for 
a given value of p 2 = x and for complex y with Mathematica [T7], as well as by using CUDA- 
Fortran and a separate kernel to obtain and verify the information. Fig. [Tj shows the complex 
y-plane for some values of p 2 = x. In fact, there are not many points in the complex x-plane 
where the branch cut does not interfere with the original integration contour along the positive 
real y-axis. We calculated the region where no obstruction occurs numerically by using a kernel 
designed for this purpose. Fig. [2] shows the regions in the complex x-plane where the original 
contour remains unharmed. Since the branch cut changes its orientation, size and shape we 
have to divide the complex x-plane into regions within we can apply the same contour. Fig. 
[3] shows the regions we have chosen. There is nothing to do in region 1 , thus we proceed by 
investigating the other regions. In the following we show the parametrizations of the deformed 
contours for the regions 2, 3, 4 and 5. The Figs. |4j [5] and [6] show the contours obtained by these 
parametrizations . 

• Region 2 and 5 

- %5),i : < t < 1 
#(2,5),l(*) = 15texp{iarg(a;)} 

- #(2, 5), 2 • 1 — ^ — ^ 

V(2,5),2(t) = 15(2 - t) exp{iarg(x)} - (1 - t)A 2 

• Region 3 

- «3,i : < t < 1 
ff 3 ji(t) = tOAi 

- «s,2 : 1 < t < 2 

*3,2(*) = 0.1(sin((2 - t)ir) + i(cos((2 - t)n) + 5)) 

- %,3 : 2 < t < 3 

% i3 (t) = (3 - *)(0.6i) - (2 - t)13exp{iarg(x)} 

- #3,4 : 3 < t < 4 

tf 3A (t) = (4-£)13exp{iarg(x)} - (3 - t)(-20 + 18t) 

- #3,5 : 4 < t < 5 

%' 5 (t) = (4 - i)(-20 + 18i) - (3 - t)A 2 

• Region 4 

- #4,1 : < t < 1 
ff 4 !i(t) = -*0.4i 
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Figure 1: Snapshot of the complex y-plane after the angular integration for four different values of x — p 2 . 
blue curves given by eq. ( 25 I coincide with the rough numerical estimate for the angular integration which is 

At each point in complex y-space the branch cut looks different. 
Only in the case shown in the upper left corner we can 



expressed by the density-plot in the background. 
The original integration contour is shown by the arrow, 
keep the original contour. In all other cases we are blocked by the branch cut. The green dots represent the poles 
present due to term A in eq. (221. 
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Re(p 2 ) 



Figure 2: The parabola-shaped region on the right is the set of all complex appoints within the region of evaluation 
where the branch cut does not obstruct the integration contour along the positive real y-axis. In all other regions 
we have to deform the contour in an adequate way. 

- ^4,2 : 1 < t < 2 

%, 2 (t) = 0.1(sin((t - 2)tt - tt) + i(cos((t - 2)tt - vr) - 5)) 

- ^4,3 : 2 < t < 3 

<*f 4 ' 3 (i) = (3-t)(-0.6i) - (2-t)13exp{farg(x)} 

- ^4,4 : 3 < t < 4 

<T 4 ' 4 (t) = (4-f)13exp{iarg(x)} - (3-i)(-20- 18i) 

- ^4,5 : 4 < i < 5 

<^ 4 ' 5 (t) = (4 - 0(-20 - 18i) - (3 - t)A 2 

5.^. S'iep ^: Preparation 

In this step we have to choose the size of the matrix to which the complex rc-plane is mapped 
to. In this example we used M x N = 128 x 128 for the size of the matrix X, and we are 
interested in the area —5 < D\z x < 5, —5 < 3m x < 5. The matrix is filled with 128 2 points 
which are homogeneously distributed over the region we restricted this consideration to. The 
CUDA kernel achieving this is called with a block size of 16 x 16 threads and with 8x8 blocks 
forming the grid. Each thread in the grid, which can be uniquely addressed by a tuple of numbers 
operates only on the matrix element X(i,j). The indices i, j are running from 1 to 128. 
Thus, for example the thread identified by the tuple (34, 117) operates only on the matrix entry 
X(34,117). 

5. 5. Step 5: Evaluation of the integrals 

This is the last step of our program. The kernel corresponding to this step is launched with 
the same parameters as the one in the step before, that is we still use 16 x 16 blocks organized 
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Figure 3: We split the complex x-plane into 5 regions where we will apply different deformations of the integration 
contour. In region 1 we can keep the original contour along the positive real j/-axis. Furthermore, in the regions 2 
and 5 we can apply the same contour, such that we are left with 3 different parametrizations covering the regions 
2, 3, 4 and 5, and the original contour which we can keep in region 1. 




Figure 4: The left picture shows the contour as chosen for the regions 2 and 5. In the right picture there is no 
obstruction by the cut and one could in principle also use the original contour. 
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Figure 5: The left picture shows the contour as chosen for region 3, the right picture shows a close-up. One has 
to avoid the pole when deforming the contour. The contour is closed via the upper half-plane. 




Figure 6: The left picture shows the contour as chosen for region 4, the right picture shows a close-up. One has 
to avoid the pole when deforming the contour. The contour is closed via the lower half-plane. 
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Figure 7: The imaginary part of the solution of eq. ( 16 1 after regularization and rescaling. The numerical data 



(blue dots) perfectly agrees with the exact solution provided by the surface-plot. 





Intel Xeon CPU 


NVIDIA GeForce GPU 


Device 


X5650 (1 core) 


GTX 550 Ti 


GTX 480 


Tesla C2070 


Runtime 


252m 2s 


6m21s 


4m21s 


2m38s 


Speedup 


1 


» 39.6 


» 57.9 


« 95.5 



Table 1: Comparison of running time and speed for a certain set of parameters. We used a modern CPU and 
compared against three different CPUs. 



in a 8 x 8 grid. Each thread operates only on the matrix entry it is assigned to and performs the 
integrations according to its position in complex x-space. The decision which contour a certain 
thread has to choose is made by a simple IF-THEN-ELSE construction. 



5.6. Results 

Finally we can compare the results obtained by numerical integration compared to the exact 
solution given by eq. (18). The plots are shown in the Figs. [7] and [8j This concludes our worked 
example. 



5.7. Execution time 

We compared the total execution time of a cheap consumer graphics card and two high 
performance graphics cards to the time needed by the code to run on one core of a modern 
CPU. The speedup results are presented in Table [T] 



6. Conclusions 

To summarize, we presented the numerical determination of the analytic structure of correla- 
tion functions given by momentum space integrals exploiting the parallel computing capability of 
a GPU. This rather complicated numerical analysis can then be performed within minutes. We 
provided a step-by-step description for the procedure required to obtain a numerical solution, 
and we presented a worked example and compared to its exact solution. 

The independence of the points in the complex plane makes such an investigation a perfect 
candidate for GPU treatment, which pays off even more with increasing matrix-size and/or 
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Figure 8: The real part of the solution of eq. ( 16 1 after regularization and rescaling. The numerical data (blue 
dots) perfectly agrees with the exact solution provided by the surface-plot. 



increasing number of nodes. Note that there may be still potential in speeding up both, the 
procedure and the code, which we have not done so far. 

The development of this procedure has been the first step in an ongoing investigation [8] of 
the analytic structure of correlators with different expressions as input. Hereby several different 
integrals of the form as given in eq. (15) have been evaluated. The here presented example 



served as decisive test for the achieved accuracy. Given the general features of the problem we 
are confident that the algorithm presented here will be valuable also for other investigations. 
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